WIP: feat: added logic to handle cert tnf cert rotation #1493

eggfoobar · 2025-10-07T21:34:13Z

added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service

coderabbitai · 2025-10-07T21:34:40Z

Walkthrough

Adds a restart-etcd command to the setup runner; introduces an operator Secret event handler to detect etcd cert changes and trigger restart jobs; exposes a PCS API to restart etcd; extends job tooling with a restart-etcd type and timeout; and adds a runner that invokes the PCS restart.

Changes

Cohort / File(s)	Summary
CLI command wiring `cmd/tnf-setup-runner/main.go`	Imports restart-etcd package and registers a new Cobra command via `NewRestartEtcdCommand` that runs `tnfrestartetcd.RunEtcdRestart`, exiting fatally on error.
Operator: Etcd cert change handler `pkg/tnf/operator/starter.go`	Adds Secret informer handler `handleEtcdCertChange` to detect changes in `tlshelpers.EtcdAllCertsSecretName`; serializes/debounces restarts via mutex/flag; lists control-plane nodes, starts `JobTypeRestartEtcd` jobs, waits for completion, then deletes preceding fencing jobs; integrates into Secrets Add/Update/Delete wiring; adds `tlshelpers` import and new globals (`etcdCertUpdateTriggered`, `etcdCertUpdateMutex`).
PCS: Etcd restart API `pkg/tnf/pkg/pcs/etcd.go`	Adds exported `RestartEtcd(ctx context.Context) error` to check PCS resource status and restart the etcd resource if present, returning errors encountered.
Tools: Jobs types/timeouts `pkg/tnf/pkg/tools/jobs.go`	Adds `RestartEtcdJobCompletedTimeout` constant and `JobTypeRestartEtcd` enum value; updates `JobType.GetSubCommand()` to return `"restart-etcd"` for this job type.
Runner: restart-etcd `pkg/tnf/restart-etcd/runner.go`	Adds `RunEtcdRestart()` which creates a background context, logs start/end, calls `pcs.RestartEtcd(ctx)`, and returns any error.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title accurately reflects the main functionality—adding logic for TNF certificate rotation—but the “WIP:” and “feat:” prefixes introduce unnecessary noise and do not contribute to clarity.
Description Check	✅ Passed	The description succinctly explains that a handler function was added to trigger a podman-etcd restart job on both nodes, which directly corresponds to the secret-watch and restart logic implemented in the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e4b1946 and 00d5025.

📒 Files selected for processing (5)

cmd/tnf-setup-runner/main.go (3 hunks)
pkg/tnf/operator/starter.go (4 hunks)
pkg/tnf/pkg/pcs/etcd.go (1 hunks)
pkg/tnf/pkg/tools/jobs.go (3 hunks)
pkg/tnf/restart-etcd/runner.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/tnf/pkg/pcs/etcd.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-10-07T21:35:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar
Once this PR has been reviewed and has the lgtm label, please assign mshitrit for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cmd/tnf-setup-runner/OWNERS
pkg/tnf/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e06c84b and e4b1946.

📒 Files selected for processing (5)

cmd/tnf-setup-runner/main.go (3 hunks)
pkg/tnf/operator/starter.go (4 hunks)
pkg/tnf/pkg/pcs/etcd.go (1 hunks)
pkg/tnf/pkg/tools/jobs.go (3 hunks)
pkg/tnf/restart-etcd/runner.go (1 hunks)

coderabbitai · 2025-10-07T21:39:02Z

pkg/tnf/operator/starter.go

+	for _, node := range nodeList {
+		runJobController(ctx, tools.JobTypeRestartEtcd, &node.Name, controllerContext, operatorClient, client, kubeInformersForNamespaces)
+	}


⚠️ Potential issue | 🟠 Major

Avoid spawning restart job controllers on every secret event

runJobController spins up a long-lived controller (go ...Run(ctx, 1)). Calling it inside the cert-change handler means every secret update (and even re-list events) will start another controller instance per node, leading to unbounded goroutines and duplicated informers fighting over the same tnf-restart-etcd-job-* resources. Please start these controllers once (e.g. alongside the other TNF job controllers) and let the handler only manage job lifecycle triggers.

eggfoobar · 2025-10-07T22:10:47Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation

openshift-ci · 2025-10-07T22:10:50Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7ace3120-a3ca-11f0-859f-c3a8764c54a7-0

added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service Signed-off-by: ehila <[email protected]>

openshift-ci · 2025-10-08T09:40:56Z

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling	`e4b1946`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-aws-disruptive-ovn	`e4b1946`	link	false	`/test e2e-aws-disruptive-ovn`
ci/prow/verify	`00d5025`	link	true	`/test verify`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`e4b1946`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-azure-ovn-etcd-scaling	`e4b1946`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-aws-disruptive	`e4b1946`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-metal-ovn-two-node-fencing	`e4b1946`	link	false	`/test e2e-metal-ovn-two-node-fencing`
ci/prow/e2e-gcp-disruptive	`e4b1946`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`e4b1946`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-ovn-etcd-scaling	`e4b1946`	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-gcp-disruptive-ovn	`e4b1946`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-aws-etcd-certrotation	`e4b1946`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-metal-assisted	`e4b1946`	link	false	`/test e2e-metal-assisted`
ci/prow/e2e-aws-cpms	`e4b1946`	link	false	`/test e2e-aws-cpms`
ci/prow/okd-scos-e2e-aws-ovn	`00d5025`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-aws-etcd-recovery	`e4b1946`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-gcp-ovn-etcd-scaling	`e4b1946`	link	false	`/test e2e-gcp-ovn-etcd-scaling`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

slintes

I'm not sure if this can work, because AFAIK there is a delay between the cert Secret being updated, and the actual files on the nodes being synced. I'm afraid the podman-etcd restart happens too early.
Also, I tried to create a cluster with this patch. Looks like it runs the new restart jobs too early, when the pacemaker cluster wasn't setup yet, which fails cluster creation:

$ k get job,pod
NAME                                      STATUS     COMPLETIONS   DURATION   AGE
job.batch/tnf-after-setup-job-master-0    Complete   1/1           4m35s      20h
job.batch/tnf-after-setup-job-master-1    Complete   1/1           4m34s      20h
job.batch/tnf-auth-job-master-0           Complete   1/1           6s         20h
job.batch/tnf-auth-job-master-1           Complete   1/1           7s         20h
job.batch/tnf-fencing-job                 Complete   1/1           4m31s      20h
job.batch/tnf-restart-etcd-job-master-0   Failed     0/1           20h        20h
job.batch/tnf-restart-etcd-job-master-1   Failed     0/1           20h        20h
job.batch/tnf-setup-job                   Complete   1/1           4m21s      20h

NAME                                      READY   STATUS      RESTARTS   AGE
pod/etcd-guard-master-0                   1/1     Running     0          20h
pod/etcd-guard-master-1                   1/1     Running     0          20h
pod/etcd-master-0                         4/4     Running     0          20h
pod/etcd-master-1                         4/4     Running     0          20h
pod/installer-3-master-1                  0/1     Completed   0          20h
pod/installer-5-master-0                  0/1     Completed   0          20h
pod/installer-5-master-1                  0/1     Completed   0          20h
pod/installer-6-master-0                  0/1     Completed   0          20h
pod/installer-6-master-1                  0/1     Completed   0          20h
pod/installer-7-master-0                  0/1     Completed   0          20h
pod/installer-7-master-1                  0/1     Completed   0          20h
pod/revision-pruner-7-master-0            0/1     Completed   0          20h
pod/revision-pruner-7-master-1            0/1     Completed   0          20h
pod/tnf-after-setup-job-master-0-8btwt    0/1     Completed   0          20h
pod/tnf-after-setup-job-master-1-h4k9r    0/1     Completed   0          20h
pod/tnf-auth-job-master-0-fwr26           0/1     Completed   0          20h
pod/tnf-auth-job-master-1-4z72f           0/1     Completed   0          20h
pod/tnf-fencing-job-8jzgf                 0/1     Completed   0          20h
pod/tnf-restart-etcd-job-master-0-2zzcw   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-4rpwm   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-5lqzp   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-lld2c   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-2c6dp   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-7vx2c   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-pssr6   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-tqhks   0/1     Error       0          20h
pod/tnf-setup-job-d9prb                   0/1     Completed   0          20h

$ k logs tnf-restart-etcd-job-master-0-2zzcw 
I1015 13:13:41.793135   14200 runner.go:12] Running TNF etcd restart
I1015 13:13:41.793260   14200 etcd.go:37] Checking pcs resources
I1015 13:13:41.793298   14200 exec.go:24] Executing: /usr/bin/nsenter -a -t 1 /bin/bash -c /usr/sbin/pcs resource status
I1015 13:13:43.190362   14200 exec.go:38]   stdout: 
I1015 13:13:43.190670   14200 exec.go:39]   stderr: Error: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
I1015 13:13:43.190726   14200 exec.go:41]   err: exit status 1
E1015 13:13:43.190744   14200 etcd.go:41] exit status 1Failed to get pcs resource statusstdoutstderrError: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
errexit status 1
F1015 13:13:43.190795   14200 main.go:126] exit status 1

openshift-merge-robot · 2025-10-18T22:33:11Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2025

openshift-ci bot requested review from jaypoulz and slintes October 7, 2025 21:35

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

feat: added logic to handle cert tnf cert rotation

00d5025

added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service Signed-off-by: ehila <[email protected]>

eggfoobar force-pushed the add-tnf-cert-rotation-handling branch from e4b1946 to 00d5025 Compare October 8, 2025 06:10

slintes reviewed Oct 16, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: feat: added logic to handle cert tnf cert rotation #1493

WIP: feat: added logic to handle cert tnf cert rotation #1493

Uh oh!

eggfoobar commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 7, 2025

Uh oh!

eggfoobar commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

slintes left a comment

Uh oh!

openshift-merge-robot commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: feat: added logic to handle cert tnf cert rotation #1493

Are you sure you want to change the base?

WIP: feat: added logic to handle cert tnf cert rotation #1493

Uh oh!

Conversation

eggfoobar commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

eggfoobar commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 7, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

slintes left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-merge-robot commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Oct 7, 2025 •

edited

Loading